Joining Massive High-Dimensional Datasets

نویسندگان

  • Tamer Kahveci
  • Christian A. Lang
  • Ambuj K. Singh
چکیده

We consider the problem of joining massive datasets. We propose two techniques for minimizing disk I/O cost of join operations for both spatial and sequence data. Our techniques optimize the available buffer space using a global view of the datasets. We build a boolean matrix on the pages of the given datasets using a lower bounding distance predictor. The marked entries of this matrix represent candidate page pairs to be joined. Our first technique joins the marked pages iteratively. Our second technique clusters the marked entries using rectangular dense regions that have minimal perimeter and fit into buffer. These clusters are then ordered so that the total number of common pages between consecutive clusters is maximal. The clusters are then read from disk and joined. Our experimental results on various real datasets show that our techniques are 2 to 86 times faster than the competing techniques for spatial datasets, and 13 to 133 times faster than the competing techniques for sequence datasets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease

Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

یک روش مبتنی بر خوشه‌بندی سلسله‌مراتبی تقسیم‌کننده جهت شاخص‌گذاری اطلاعات تصویری

It is conventional to use multi-dimensional indexing structures to accelerate search operations in content-based image retrieval systems. Many efforts have been done in order to develop multi-dimensional indexing structures so far. In most practical applications of image retrieval, high-dimensional feature vectors are required, but current multi-dimensional indexing structures lose their effici...

متن کامل

Spherically Symmetric Solutions in a New Braneworld Massive Gravity Theory

In this paper, a combination of the braneworld scenario and covariant de Rham-Gabadadze-Tolley (dRGT) massive Gravity theory is proposed. In this setup, the five-dimensional bulk graviton is considered to be massive. The five dimensional nonlinear ghost-free massive gravity theory affects the 3-brane dynamics and the gravitational potential on the brane. Following the solutions with spherical s...

متن کامل

Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. “Structure” can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003